Speech-controlled animation system

ABSTRACT

Methods, systems and apparatuses directed toward an authoring tool that gives users the ability to make high-quality, speech-driven animation in which the animated character speaks in the user&#39;s voice. Embodiments of the present invention allow the animation to be sent as a message over the Internet or used as a set of instructions for various applications including Internet chat rooms. According to one embodiment, the user chooses a character and a scene from a menu, then speaks into the computer&#39;s microphone to generate a personalized message. Embodiments of the present invention use voice-recognition technology to match the audio input to the appropriate animated mouth shapes creating a professional looking 2D or 3D animated scene with lip-synced audio characteristics.

FIELD OF THE INVENTION

[0001] The present invention relates to animation systems and, moreparticularly, to a method and apparatus for generating an animatedsequence having synchronized visual and audio components.

BACKGROUND OF THE INVENTION

[0002] Existing technology related to Internet communication systemsincludes such applications as pre-animated greetings, avatars, e-mailweb based audio delivery and video conferencing. Originally, e-mailmessages were sent through the Internet as text files. However, soon thecommercial demand for more visual stimulus and the advances incompression technology allowed graphics in the form of shortpre-animated messages with imbedded audio to be made available to theconsumer. For example, software packages from Microsoft GreetingsWorkshop allow a user to assemble a message with pre-existing graphics,short animations and sound. These are multimedia greeting cards that canbe sent over the Internet but without the voice or gesture of theoriginal sender.

[0003] Existing software in the area of video conferencing allows audioand video communication through the Internet. Connectix, Sony Funmailand Zap technologies have developed products that allow a video imagewith sound to be sent over the Internet. Video Email can be sent as anexecutable file that can be opened by the receiver of the messagewithout the original software. However, video conferencing requires bothsender and receiver to have the appropriate hardware and software.Although video e-mail and conferencing can be useful for businessapplications many consumers have reservations about seeing their ownimage on the screen and prefer a more controllable form ofcommunication.

[0004] In the area of prior art Internet messaging software, a varietyof systems have been created. Hijinx Masquerade software allows text tobe converted into synthetic voices and animated pictures that speak thevoices. The system is designed to use Internet Relay Chat (IRC)technology. The software interface is complicated and requires the userto train the system to match text and image. The result is a very choppyanimated image with mouth shape accompanied by a synthetic computervoice. The software is limited by its inability to relay the actualvoice of its user in sync with a smooth animation. In addition, aMitsubishi technology research group has developed a voice puppet, whichallows an animation of a static image file to be driven by speech in thefollowing manner. The software constructs a model using a limited set ofthe speaker's facial gestures, and applies that model to any 2D or 3Dface, using any text, mapping the movements on to the new features. Inorder to learn to mimic someone's facial gestures, the software needsseveral minutes of video of the speaker, which it analyzes, maps andstylizes. This software allows a computer to analyze and stylize videoimages, but does not directly link a user's voice to animation forcommunication purposes. Geppetto software also aids professionalanimators in creating facial animation. The software helps professionalsgenerate lip-sync and facial control of 3D computer characters for 3Dgames, real-time performance and network applications. The system inputsthe movement of a live model into the computer using motion analysis andMIDI devices. Scanning and motion analysis hardware capture a face andgestures in real time and then records the information into a computerfor animation of a 3D model.

[0005] Prior art software for Internet communication has also produced“avatars”, which are simple characters that form the visual embodimentof a person in cyberspace and are used as communication and sales toolson the Internet. These animations are controlled by real time commands,allowing the user to interact with others on the Internet. Microsoft'sV-Chat software offers an avatar pack, which includes downloadablecharacters and backgrounds, and which can be customized by the user witha character editor. The animated character can be represented in 3D orin 2D comic style strip graphic with speech bubbles. It uses theInternet Relay Chat (IRC) protocol and can accommodate private or groupchats. The user is required to type the message on a keyboard and ifdesired choose an expression from a menu. Accordingly, while chattingthe user must make a conscious effort to link the text with theappropriate character expression, since the system does notautomatically perform this operation. In addition, the animatedcharacters do not function with lip-synced dialogue generated by theuser.

[0006] A number of techniques and systems exist for synchronizing themouth movements of an animated character to a spoken sound track. Thesesystems, however, are mainly oriented to the entertainment industry,since their operation generally requires much technical sophisticationto ultimately produce the animated sequence. For example, U.S. Pat. No.4,360,229 discloses a system where recorded sound track is encoded intoa sequence of phoneme codes. This sequence of phoneme codes is analyzedto produce a sequence of visual images of lip movements corresponding tothe sound track. These visual images can then be overlaid onto existingimage frames to yield an animated sequence. Similarly, U.S. Pat. No.4,91.3,539 teaches a system that constructs a synchronized animationbased upon a recorded sound track. The system disclosed therein useslinear prediction techniques, instead of phoneme recognition devices tocode the sound track. This system, however, requires that the user“train” the system by inputting so-called “training utterances” into thesystem, which compares the resulting signals to the recorded sound trackand generates a phonetic sequence.

[0007] Furthermore, speech-driven animation software has been developedto aid in the laborious task of matching specific mouth shapes to eachphoneme in a spoken dialogue. LipSync Talkit and Talk Master Pro work asplugins for professional 3D animation programs such as 3D Studio Max andLightwave 3D. These systems take audio files of dialogue, link them tophonemes and morph the 3D-speech animation based on facial bonetemplates created by the animator. Then the animation team assembles theremaining animation. These software plugins, however, require otherprofessional developer-software to implement their functionality forcomplete character design. In addition, they do not function asself-contained programs for the purpose of creating speech drivenanimations and sending these animations as messages through theInternet.

[0008] The user of prior art speech-driven animation software generallymust have extensive background in animation and 3D modeling. In light ofthe foregoing, a need exists for an easy-to-use method and system forgenerating an animated sequence having mouth movements synchronized to aspoken sound track inputted by a user. The present inventionsubstantially fulfills this need and a tool for automated animation of acharacter without prior knowledge of animation techniques from the enduser.

SUMMARY OF THE INVENTION

[0009] The present invention provides methods, systems and apparatusesdirected toward an authoring tool that gives users the ability to makehigh-quality, speech-driven animation in which the animated characterspeaks in the user's voice. Embodiments of the present invention allowthe animation to be sent as a message over the Internet or used as a setof instructions for various applications including Internet chat rooms.According to one embodiment, the user chooses a character and a scenefrom a menu, then speaks into the computer's microphone to generate apersonalized message. Embodiments of the present invention usevoice-recognition technology to match the audio input to the appropriateanimated mouth shapes creating a professional looking 2D or 3D animatedscene with lip-synced audio characteristics.

[0010] The present invention, in one embodiment, creates personalizedanimations on the fly that closely resemble the high quality ofhand-finished products. For instance, one embodiment of the presentinvention recognizes obvious volume changes and adjusts the mouth sizeof the selected character to the loudness or softness of the user'svoice. In another embodiment, while the character is speaking, theprogram initiates an algorithm that mimics common human gestures andreflexes—such as gesturing at an appropriate word or blinking in anatural way. In one embodiment, the user can also add gestures, facialexpressions, and body movements to enhance both the natural look of thecharacter and the meaning of the message. Embodiments of the presentinvention also includes modular action sequences—such as running,turning, and jumping—that the user can link together and insert into theanimation. The present invention allows several levels ofpersonalization, from the simple addition of voice and message tocontrol over the image itself. More computer-savvy users can scan intheir own images and superimpose a ready-made mouth over their picture.The software can also accept user-created input from standard art andanimation programs. More advanced audio controls incorporatepitch-shifting audio technology, allowing the sender to match theirvoice to a selected character's gender, age and size.

[0011] The present invention combines these elements to produce avariety of communication and animation files. These include adeliverable e-mail message with synchronized video and audio componentsthat a receiver of the message can open without the original program, aninstruction set for real-time chat room communications, and animationfiles for web, personal animation, computer game play, video production,training and education applications.

[0012] In one aspect the present invention provides a method forgenerating an animated sequence having synchronized visual and audiocharacteristics. The method comprises (a) inputting audio data; (b)detecting a phonetic code sequence in the audio data; (c) generating anevent sequence using the phonetic code sequence; and (d) sampling theevent sequence. According to one embodiment, the method furthercomprises (e) constructing an animation frame based on the sampling step(d); and (f) repeating steps (d)-(e) a desired number of times to createan animation sequence.

[0013] In another aspect, the present invention provides an apparatusfor generating an animated sequence having synchronized visual and audiocharacteristics. According to this aspect, the apparatus comprises amouth shape database, an image frame database, an audio input module, anevent sequencing module, a time control module, and an animationcompositing module. According to the invention, the audio input moduleincludes a phonetic code recognition module that generates a phoneticcode sequence from audio data. The event sequencing module is operablyconnected to the audio input module and generates an event sequencebased on a phonetic code sequence. The time control module is operablyconnected to the event sequencing module and includes a sampling module,which samples the event sequence. The animation compositing module isoperably connected to the sampling module and the mouth shape and imageframe databases According to the invention, the animation compositingmodule is responsive to the time control module to receive an eventsequence value, retrieve a mouth shape from the mouth shape database andan image frame from the image frame database, and composite the mouthshape onto the image frame.

DESCRIPTION OF THE DRAWINGS

[0014]FIG. 1 is a functional block diagram illustrating one embodimentof the apparatus of the present invention.

[0015]FIG. 2 is a flow chart setting forth a method according to thepresent invention.

[0016]FIG. 3 is a flow chart showing a method for filtering a phoneticcode sequence according to the present invention.

[0017]FIG. 4 is a flow chart illustrating a method generating an eventsequence according to the present invention.

[0018]FIG. 5 is a flow chart setting forth a method for constructing ananimation sequence according to the present invention.

[0019]FIG. 6 is a flow chart diagram providing a method for use inreal-time playback.

[0020]FIG. 7 is a functional block diagram illustrating application ofthe present invention to a computer network.

[0021]FIG. 8 provides four time lines illustrating, for didacticpurposes, the sequencing and sampling steps according to one embodimentof the present invention.

[0022]FIGS. 9A-9R illustrate mouth shapes each associated with a phonemeor set of phonemes.

[0023]FIGS. 10A, 10B, and 10C; FIG. 10A is an image of the head of ananimated character; FIG. 10B is an enlarged portion of FIG. 10A andillustrates the registration pixels used in certain embodiments of thepresent invention; and, FIG. 10C is an image of the head of an animatedcharacter over which a mouth shape from one of FIGS. 9A-9R is compositedto create an animation frame.

DETAILED DESCRIPTION OF THE INVENTION

[0024]FIG. 1 illustrates an apparatus according to one embodiment of thepresent invention. As FIG. 1 shows, the apparatus, according to oneembodiment, comprises audio module 31 operably connected to microphone33, event sequencing module 40, time controller/sampling module 42,animation compositing and display module 44. The above-described modulesmay be implemented in hardware, software, or a combination of both. Inone embodiment, the above-described modules are implemented in softwarestored in and executed by a general purpose computer, such as a Win-32based platform, a Unix-based platform, a Motorola/AppleOS-basedplatform, or any other suitable platform. In another embodiment, theabove-described modules are implemented in a special-purpose computingdevice.

[0025] A. Audio Module

[0026] According to one embodiment of the present invention, audiomodule 31 includes components for recording audio data and detecting aphonetic code sequence in the recorded audio data. As FIG. 1 shows,audio module 31 includes phonetic code recognition module 36 andphonetic code filter 38. According to this embodiment, phonetic coderecognition module detects a phonetic code sequence in the audio data,while phonetic code filter 38 filters the detected phonetic codesequence.

[0027] 1. Recording Audio Data and Phonetic Code Recognition

[0028] According to the invention, audio data is inputted into theapparatus of the present invention (FIG. 2, step 102). In oneembodiment, a user speaks into microphone 33 to input audio data. Audiomodule 31 records the audio data transduced by microphone 33 and, in oneembodiment, stores the recorded audio data in digital form, such as inWAV or MP3 file formats. Other suitable formats include, but are notlimited to, RAM, AIFF, VOX, AU, SMP, SAM, AAC, and VQF.

[0029] According to the invention, phonetic code recognition module 36analyzes and detects a phonetic code sequence in the audio data (FIG. 2,step 104). In one embodiment, phonetic code recognition module 36detects a sequence of phonemes in the audio data. In one such anembodiment, phonetic code recognition module 36 detects a phoneme at apredetermined sampling or time interval. The time or sampling intervalat which the audio data is analyzed can be any suitable time interval.In one embodiment, as time line A of FIG. 8 shows, a phoneme is detectedin the audio data at 10 millisecond (ms) intervals.

[0030] In another embodiment, phonetic code recognition module 36generates a phonetic code sequence comprising a set of phonemeprobability values for each time interval. According to one suchembodiment, phonetic code recognition module 36 generates, for each timeinterval, a list of all phonemes recognized by module 36 and a phonemeprobability value indicating the likelihood that the correspondingphoneme is the actual phoneme recorded during the time interval. In oneembodiment, the phoneme having the highest probability value is used forthat time point. In another embodiment, and as discussed in Section A.2.below, these phonetic code probabilities, are averaged over an averaginginterval. According to this embodiment, the phoneme having the highestprobability value over the averaging interval is used as the phoneme forthe averaging interval.

[0031] Suitable phonetic code recognition engines for use in the presentinvention include the BaBel ASR version 1.4 speaker-independent speechrecognition system based on hybrid HMM/ANN technology (Hidden MarkovModels and Artificial Neural Networks) from BaBel Technologies, Inc.,Boulevard Dolez, 33, B-7000 Mons, Belgium. Another suitable device issold under the trademark SPEECH LAB obtainable from Heuristics, Inc.Additionally, yet another suitable phoneme recognition engine is sole byEntropic, Inc. (recently acquired by Microsoft, Inc.). Of course, almostany available phoneme recognition engine can be used in the presentinvention.

[0032] In one embodiment, the volume level of the audio data is detectedand recorded. In one form, the volume level is detected and recorded atthe same sampling rate as phonetic code recognition. In one embodiment,this sequence of volume levels is used in connection with the phoneticcode sequence to adjust the size and/or configuration of the mouthshapes during the animation. For example and in one embodiment, an “O”sound detected at a low decibel level may be mapped to a small “O” mouthshape, while a louder “O” sound will be mapped to a larger “O” mouthshape. (See discussion below.)

[0033] 2. Filtering the Phonetic Code Sequence

[0034] In one embodiment, the phonetic code sequence is filtered. Anysuitable algorithm for filtering the phonetic code sequence can be used.

[0035]FIG. 3 illustrates a filtering method for use in the presentinvention. In one embodiment, filtering is accomplished, as alluded toabove, by averaging phoneme probability values over an averaginginterval and selecting the phoneme having the highest average phonemeprobability value. Time line B of FIG. 8 illustrates the time points(Ref. Nos. 1-8), relative to time line A, in the phonetic code sequenceafter it has been filtered. In the embodiment shown, the averaginginterval comprises 4 time intervals, totaling 40 milliseconds of theunfiltered phonetic code sequence. According, the filtered phonetic codesequence, according to one embodiment, includes a phonetic code valuefor every 40 ms. Of course, any suitable averaging interval can be used.

[0036] More specifically and in one embodiment, the apparatusinitializes the variables used in the averaging algorithm (FIG. 3, steps202 and 204). As used in FIG. 3, T represents the time point in theaudio data; Tl represents the time or sampling interval between phoneticcode sequences; and P is the number of recognized phonemes. As FIG. 3indicates, starting at the first time point in the phonetic codesequence (T=0), the respective phoneme probability value (PhoneProb_(i))for each recognized phoneme is added to accumulator (X_(i)) (FIG. 3,step 206) over an averaging interval (see FIG. 3, steps 208 and 210). Atthe end of each averaging interval, the average phoneme probability foreach recognized phoneme (AvgPhoneProb_(i)) is calculated (FIG. 3, step212). In one embodiment, the phoneme having the highest probabilityvalue is used as the phoneme for that time point in the filtered eventsequence (FIG. 3, step 216). The averaging variables, X_(i) and Tl, arereset and the averaging process repeated for the duration of thephonetic code sequence (FIG. 3, steps 214, 216 and 204).

[0037] In another embodiment, phonetic code-sequences are filtered toeliminate spurious phonetic code values. For example and in oneembodiment, the phonetic codes detected over an averaging interval arecompared. A set of rules or conditions is applied to the sequence tofilter out apparently spurious values. According to one embodiment, if aparticular phonetic code occurs only once over the averaging interval,it is filtered out of the phonetic code sequence.

[0038] B. Event Sequencing Module

[0039] Event sequencing module 40, in one embodiment, generates an eventsequence based in part on the phonetic code sequence detected (and, insome embodiments, filtered) by audio module 31 (FIG. 2, step 106). Inone embodiment, event sequencing module 40 applies a set ofpredetermined animation rules to the phonetic code sequence to create anevent sequence that synchronizes the mouth shape animation with theaudio data. In the embodiment shown in FIG. 1, event sequencing module40 includes event sequencing database 41 that stores a set of animationrules. In one form, the animation rules comprise mouth shapes orsequences of mouth shapes each associated with one or more phoneticcodes. In another embodiment, the animation rules further comprise mouthshapes or sequences of mouth shapes associated with a plurality ofconsecutive phonetic codes. In yet another embodiment, the animationrules further comprise mouth shapes or sequences of mouth shapesassociated with phonetic transitions. In one embodiment, eventsequencing module constructs an event sequence having a floating timescale in that the time interval between events is not uniform.

[0040] In one embodiment, event sequencing module builds a sequence ofmouth shape identifiers by applying a set of animation rules to thephonetic code sequence. As discussed more fully below, the mouth shapeidentifiers point to files storing the various mouth shapes to be addedto the animation in synchronization with the audio data. In addition, asdiscussed in section C., below, this sequence of mouth shape identifiersis subsequently sampled to construct an animated sequence (either inreal-time or non-real-time modes). In one embodiment, the sequence ofmouth shapes and corresponding volume data is sampled to construct ananimated sequence.

[0041] In one embodiment using phonemes, each phoneme has associatedtherewith a particular mouth shape or sequence of mouth shapes. Inanother embodiment involving cartoon animation, for example, phonemeshaving similar mouth shapes are grouped together and associated with oneevent (mouth shape) or sequence of events (mouth shapes). In oneembodiment, animated mouth shapes are stored in mouth shape database 46,each in a file having a mouth shape identifier. TABLE 1 Phoneme ExampleMouth Shape ID sil <Silence> m2  a apple m11 ay aim m11 A art m11 & abutm11 uU shut m11 @ all m11 ee easy m10 E ever m10 r> urge, her m13 I ivym11 i ill m6  O over m7  OU ouch m11 OI joy m7  U wood m13 u boot m12 yyellow m5  b bed, rib m4  ch chop, itch m3  d dock, sod m5  f fan, offm9  g go, big m3  h hat m3  j job, lodge m5  k kick, call m3  l loss,pool m8  m map, dim m4  n no, own m5  N gong m5  P pop, lip m4  r rob,car m13 s sun, less m5  SH shy, fish m13 th this, either m5  t tie, catm5  T thin, with m5  v vivid m9  w we, away m13 z zebra, raise m5  Zmirage, vision m3  dd ladder m5  (flapped allophone)

[0042] Table 1 provides an illustrative set of phonemes and mouth shapesor sequences of mouth shapes associated with each phoneme. The mouthshapes identifiers listed in Table 1 correspond to the mouth shapes ofFIG. 9A-9R. As Table 1 and FIGS. 9A-9R show, several phonemes arerepresented by the same mouth shape or sequence of mouth shapes. Inaddition, as Table 1 and FIGS. 9A-9R illustrate, certain phonemes, insome embodiments, are associated with a smaller mouth shape (e.g., m6 a,m7 a, m11 a, etc.) and a larger mouth shape (m6 b, m7 b, m11 b, etc.).(See FIGS. 9E and 9F.) In one embodiment, the smaller and larger mouthshapes are used as an animation sequence. In one embodiment, thissequence provides a smoother transition between mouth shapes and,therefore, a more realistic animation.

[0043] The set of associations in Table 1; however, is only one ofmyriad possibilities. The number of mouth shapes used to represent thevarious phonemes is primarily determined by the animator and the levelof detail desired. So, for example, the word “cat” involves the “k”,“a”, and “t” phonemes, consecutively. According to the embodiment shownin Table 1, therefore, event sequencing module 40 will insert the m5,m11 a, m11 b, and m5 mouth shape identifiers into the event sequence atthe appropriate time points. In addition, the number of phonemesrecognized by the apparatus of the present invention is also a factordetermined by the phoneme recognition engine used and the desiredproperties of the system.

[0044]FIG. 4 illustrates a method for generating an event sequenceaccording to one embodiment of the present invention. Time line C ofFIG. 8 shows an event sequence constructed from the phonetic codesequence of time line B. Beginning at the first time point (Time line B,Ref. No. 1) in the phonetic code sequence (FIG. 4, step 302), eachphonetic code (Phoneme(T)) is analyzed to construct an event sequence.More specifically, if an animated sequence is associated with Phoneme(T)(FIG. 4, step 304), then event sequencing module 40 retrieves thesequence of mouth shape identifiers associated with the phoneme valuefrom sequencing database 41 (FIG. 4, step 312) and inserts them into theevent sequence. Otherwise, event sequencing module 40 retrieves a mouthshape identifier corresponding to the current phoneme (FIG. 3, step 306)and inserts it into the event sequence (step (308). In one embodiment,event sequencing module also inserts volume level data into the eventsequence. As discussed more fully below, this volume level data can beused to scale the mouth shapes to represent changes in volume level ofthe audio data.

[0045] In one embodiment, event sequencing module 40 scans for certain,recognized transitions between phonetic codes. If a recognized phoneticcode transition is detected, events (mouth shapes) are added to theevent sequence as appropriate. More particularly and in one embodiment,event sequencing module 40 compares adjacent phonemes (FIG. 4, steps 310and 316). If the particular-pair of phonemes is a recognized transitionevent (step 316), event sequencing module retrieves the sequence ofevents (step 318) from event sequencing database 41 and inserts theminto the event sequence. In other embodiments, event sequencing module40 scans the phonetic code sequence for recognized groups of three ormore phonemes and inserts events into the event sequence associated withthat group in event sequencing database 41. In one embodiment, this loopis repeated for the duration of the phonetic code sequence (FIG. 4,steps 322 and 324).

[0046]FIG. 8, time lines B and C illustrate, for didactic purposes, ahypothetical event sequence generated by event sequencing module 40. Inthe embodiment shown in Table 1, a particular phoneme may have one or aplurality of events (mouth shapes) associated with it. In addition, apair of adjacent/consecutive phonemes may also have a plurality ofevents associated with it. P_(i), in FIG. 4, represents a particularphoneme value in the phonetic code sequence, Phoneme(T). As time line Cshows, P1 corresponds to one mouth shape identifier (E1), which isinserted into the event sequence. Event sequencing module 40 steps tothe next time interval in the phonetic code sequence. (FIG. 4, step 310)and compares P1 and P2 to determine whether these adjacent phonemescorrespond to a transition event (step 316). As time line C indicates,one event (E2) is associated with the transition event and is insertedin the event sequence (steps 318 and 320). Event sequencing module theninserts event E3 corresponding to P2 at the next time interval (FIG. 4,steps 306 and 308). As E5 and E6 illustrate a plurality of eventsassociated with a transition event can be inserted into the eventsequence. The spacing of the events (E5 and E6) in the sequence can bespaced in time according to the desired animation effect. In addition,events E8 and E9 are an example of a sequence associated with onephoneme value P6. Similarly, events E10 and E11 are associated with asingle phoneme value (P7); however, the sequence is inserted after thecorresponding time point (6).

[0047] A sequence of mouth shapes, either associated with a phonetictransition or with an individual phonetic code, can be used to achieve avariety of purposes. For example, the individual phoneme for the “r”sound can be associated with a sequence of mouth shapes to provide theeffect of vibrating lips or of providing some variation in the mouthshape, such that it does not appear static or fixed. In anotherembodiment, a pair of adjacent “r” phonemes can be associated with asequence of mouth shapes that provides the effect of vibrating lips. Inanother example, a pair of adjacent “@” phonemes can be replaced withthe m11 a-m11 b sequence. In yet another example, the phonemecorresponding to the “Ol” phoneme can be represented by a sequence ofmouth shapes inserted over the duration of the presence of the phonemein the audio data to better mimic reality. In still another example, ananimation rule can prescribe certain mouth shape sequences upon theoccurrence of a certain phoneme. For example, if the “@” phoneme isencountered, an animation rule can cause event sequencing module 40 toadd the smaller mouth shape, m11 a, and then the larger mouth shapeidentifier, m11 b, in the next two time points of the event sequence,such that the event sequence comprises m11 a-m11 b-m11 b. Furthermore,the interval between mouth shape identifiers also depends on thecharacteristics of the desired animation.

[0048] 1. Phonetic Transitions

[0049] As to phonetic transitions, animated mouth shape sequences can beused to reduce or eliminate the sharp mouth shape transitions that wouldotherwise occur and, therefore, make a more fluid animation. Forexample, a sequence of mouth shapes may be inserted at transition froman “m” phoneme to an “O” phoneme. According to one form, a sequence ofmouth shapes, from closed, to slightly open, to half open, to mostlyopen, to fully open, is inserted into the event sequence. In otherembodiments, the animated mouth shape sequence associated with aphonetic transition may comprise a number of time points in the phoneticcode sequence. In such an embodiment, event sequencing module 40 isconfigured to insert the sequence at the appropriate location andoverwrite existing mouth shapes in the event sequence. The particularanimated sequences and phonetic associations, however, depend on theeffect desired by the animator. Countless variations can be employed.

[0050] According to the mouth shape identification system employed inthe embodiment, described above, the use of m#a identifies a mouth shapethat is smaller than m#b. In the embodiment shown, only vowel phonemesinclude two mouth shapes. In one embodiment, a set of animation rulesdetermines whether the larger or smaller mouth shape appears first. Thefollowing describes an illustrative set of animation rules:

[0051] a. Consonant-to-Vowel Transition

[0052] According to one embodiment of such animation rules, if a vowelphoneme follows a consonant, the smaller mouth shape of the vowelphoneme precedes the larger mouth shape. For example, the word “cat”results in a sequence of mouth shapes including m3, m11 a, m11 b and m5.Similarly, in embodiments involving more than two mouth shapes for agiven vowel phoneme, the mouth shapes appear in ascending-size order.

[0053] b. Vowel-to-Consonant Transition

[0054] According to another animation rule, if a vowel precedes aconsonant, any event sequence will end on the larger mouth shape. Forexample, the word “it” results in the sequence (m6 a-m6 b-m5).

[0055] c. Silence-to-Vowel Transition

[0056] According to a third animation rule, silence followed by a vowelrequires that the smaller mouth shape, if any, be inserted into theevent sequence before the larger mouth shape. For example, silencefollowed by “at” results in the sequence m2-m11 a-m11 b-m5.

[0057] d. Vowel-to-Silence Transition

[0058] In contrast, a vowel to silence transition, according to anotherrule, results in the opposite configuration of mouth shapes. Forexample, “no” followed by silence results in the sequence m5-m7 a-m7b-m7 a-m2.

[0059] e. Vowel-to-Vowel Transition

[0060] Pursuant to another animation rule, if a vowel-to-voweltransition is encountered, the larger mouth shape corresponding to thesecond vowel phoneme is used in the event sequence. For example,applying this rule and the vowel-to-consonant transition rule, above, to“boyish” results in the sequence m4, m7 a, m7 b, m6 b, m13.

[0061] C. Time Controller/Sampling Module

[0062] Time controller/sampling module 42 samples an event sequencegenerated by event sequencing module 40 and, in one embodiment, drivesanimation compositing and display module 44. In one configuration, anembodiment of time controller/sampling module 42 periodically samplesthe event sequence at uniform sampling intervals to create an animatedsequence. In one embodiment, time controller/sampling module 42 samplesthe event sequence according to a set of predetermined rules that adjustthe resulting animation based in part upon the time constraints imposed(see discussion below).

[0063]FIG. 8, time line D illustrates the operation of an embodiment oftime controller/sampling module 42 that, in one mode, periodicallysamples the event sequence at uniform intervals. According to theinvention, the event sequence can be sampled at any desired ratenecessary to achieve the desired animation. For example, cartoonanimated sequences are typically animated at between 12 to 24 frames persecond. Accordingly, the sampling rate or sampling interval (each afunction of one another) can be adjusted according to the frame ratedesired for the animation. For example, a cartoon animation sequencedisplayed at 24 frames per second requires a sampling interval of{fraction (1/24)}th of a second (41.7 ms). However, in order to achievesmaller file sizes for animations intended to be transmitted over acomputer network, a slower sampling rate, resulting in less frames persecond, may be used. In addition, normal video output displays 30 framesper second. Accordingly, the event sequence, for use in such anembodiment, will be sampled every {fraction (1/30)}th of a second (33.3ms). Moreover, since the sampling of the event sequence does not dependon the sampling rate of the original audio data, the present inventionallows for the same event sequence to be used in animations havingdifferent frame rates, without having to re-record the audio data,regenerate a phonetic code sequence, or recreate an event sequence.

[0064] 1. Sampling/Animation Rules

[0065] According to one embodiment, time controller/sampling module 42samples the event sequence according to a set of predetermined rules. Inone embodiment, sampling module 42 maps to the most recent mouth shapeidentifier in the event sequence (See FIG. 8, time line D). In anotherembodiment, sampling module 42 maps to the event having the closest timevalue in the event sequence.

[0066] In yet another embodiment, the sampling rules adjust theresulting animation based upon the time constraints imposed by thesampling rate. For example and as time lines C and D of FIG. 8demonstrate, the particular sampling rate used may cause timecontroller/sampling module 42 to omit or skip over certain mouth shapesin the event sequence. Accordingly and in one embodiment, timecontroller/sampling module 42 is configured to sample the event sequenceaccording to a set of predetermined rules. The predetermined rulesdepend on the configuration and effect desired by the animator. Beloware illustrative examples of sampling rules for use in an embodiment ofthe present invention.

[0067] a. Consonant to Vowel

[0068] According to one embodiment, one sampling rule requires that allpossible sizes of mouth shapes for a vowel be sampled and used inconstructing an animation, if possible. For example, the word “cat”,according to Table 1 requires an m3-m11 a-m11 b-m5 mouth shapeidentifier sequence. If the particular sampling interval, however, doesnot allow for both the m11 a and m11 b mouth shapes to be used, aconditional rule requires that the m11 b mouth shape, which is largerthan the m11 a mouth shape, be used, after the appearance of aconsonant.

[0069] b. Vowel to Consonant

[0070] According to the set of animation rules described above, the word“it” is represented by the m6 a-m 6 b-m5 mouth shape. According to anassociated sampling rule, if the sampling rate does not allow for boththe m6 a and m6 b mouth shapes, the larger mouth shape, m6 b, is sampledand used to construct an animation frame. Therefore, the word “it” wouldbe represented by the m6 b and m5 mouth shape sequence.

[0071] c. Silence to Vowel/Vowel to Silence

[0072] According to the animation rules described above, silencefollowed by the word “at” results in an event sequence comprising m2,m11 a, m11 b and m5. As above, all mouth shapes are to be used whenpossible. If, however, the sampling rate prevents the use of all mouthshapes, a sampling rule causes the smaller vowel mouth shape, m11 a, tobe omitted. The resulting sequence that is sampled comprises m2, m11 b,and m5. Similarly, the same sampling rule can be applied to a vowel tosilence transition. For example, the word “no” followed by silenceresults in the mouth shape sequence comprising m5, m7 b, m7 a.Application of the sampling rule, when required, results in a sequencem5 to m 7 a.

[0073] D. Animation Compositing and Display Module

[0074] Animation compositing and display module 44, in one embodiment,receives an event sampled by time controller/sampling module 42 andconstructs an animation frame responsive to the sampled event. As FIG. 1shows, animation compositing and display module 44 includes mouth shapedatabase 46 and image frame database 48. In one embodiment, users canselect from among a plurality of animated characters and backgroundsfrom which to generate an animated sequence. According to suchembodiment, mouth shape database 46 includes mouth shapes for one to aplurality of animated characters. For each animated character, mouthshape database 46 includes mouth shape identifiers pointing to filesstoring mouth shapes corresponding to phonetic codes or groups ofphonetic codes. Image frame database 48 stores at least one sequence ofimage frames, each including a background region and a head without amouth. As discussed below, mouth shapes are added to image frames tocreate an animation frame.

[0075] In one embodiment, the sampled event includes a mouth shapeidentifier and volume data. In one form of this embodiment, animationcompositing and display module 44 scales the size of the mouth shapeaccording to the volume data, before compositing it on the head of thecharacter. For example, a higher volume phoneme, in one embodiment,corresponds to a larger mouth shape, while a lower volume phonemecorresponds to a smaller mouth shape.

[0076] 1. Constructing an Animation Sequence

[0077] In one embodiment, animation compositing and display module 44stores each resulting frame in an animation sequence. Specifically, FIG.5 illustrates a method for sampling an event sequence and, frame byframe, constructing an animation sequence. In one embodiment, thesequence of image frames is intended to be displayed at 24 frames persecond. Accordingly, sampling module 42 will sample the event sequenceat 41.7 ms intervals. As to each frame, time controller sampling module42 samples event sequence and, in one embodiment, passes the event toanimation compositing module 44 (FIG. 5, step 402). Animationcompositing and display module 44 retrieves the first image frame in thesequence stored image frame database 48 (FIG. 5, step 406). Animationcompositing module 44 then retrieves the mouth shape corresponding tothe sampled event and adds it to the image frame to create an animationframe (FIG. 5, step 408). The resulting animation frame is then storedin an animation sequence (FIG. 5, step 410). In one embodiment, thisanimation frame compositing loop is repeated for the duration of theevent sequence (FIG. 5, steps 412 and 414) to create an animationsequence. The resulting animation sequence and the audio data can thenbe assembled into a multimedia file, such as a QuickTime or AVI movie.Other suitable file formats include, but are not limited to, MacromediaFlash, Shockwave, and Things (see www.thingworld.com).

[0078] 2. Registration of the Mouth Shape in the Image Frame

[0079] One embodiment uses registration marks to add the mouth shape tothe image frame. (See FIG. 10.) The use of these registration marksallows the head to move relative to the frame as the animation sequenceprogresses, while still allowing for proper placement of the mouthshape. In one embodiment of the present invention, each frame in ananimation sequence is a digital image. One form of this embodiment usesthe Portable Network Graphics (PNG) format, which allows for storage ofimages of arbitrary size with 8 bits of red, green, blue and alpha (ortransparency) information per pixel. However, any suitable image fileformat that supports transparency can be used in this embodiment of theinvention. Other suitable formats include, but are not limited to, TIFFand TGA.

[0080] In one embodiment, the individual frames of the animationsequence are created by an animator using traditional techniquesresulting in a set of image frames stored in digital form. According tothe invention, the head of the animated character is drawn separatelyfrom the mouth. As discussed above, the appropriate mouth shape is addedlater to complete an animation frame. In one embodiment of the presentinvention, the head of the animated character is drawn separately fromthe background. In another embodiment, the head is drawn together withthe background.

[0081] In one embodiment, each head frame has two pixels on it set tomark the left and right edges of the mouth. In one embodiment, thepixels are located proximally to the corners of the mouth. Of course,the location of the registration pixels is not crucial to the invention.According to this embodiment, the mouth shape also has two registrationpixels corresponding to the pixels in the head frame. In one embodiment,the registration pixels in the mouth shape do not necessarily correspondto the corners of the mouth since the location of the mouth cornersdepend on the particular mouth shape. For example a smiling mouth shapehas uplifted corners, while a frowning mouth shape has down turnedcorners. Still further, the mouth shape corresponding to an “o” phonememay have corners that lie inside of the registration pixels.

[0082] In one embodiment, the registration pixels are set by choosing apixel color that does not appear in any head or mouth image and, usingthat color, drawing the alignment pixels in the appropriate locationsand storing either the image frame or the mouth shape. In oneembodiment, when the image frame is subsequently loaded into memory, theimage file is scanned to retrieve the x- and y-coordinates of theregistration pixels (e.g., by looking for pixels having a predeterminedcolor reserved for the registration pixels). The x- and y-coordinatescorresponding to the registration pixels are stored. In one embodiment,the registration pixels are then overwritten with the color of theadjacent pixels to hide them from view. In another embodiment, the x-and y-coordinates of the registration pixels can be stored separately inanother file, or as tagged data alongside the image frame. In oneembodiment, the x- and y-coordinates of the registration pixels arestored as data in the object representing the bit map of the mouth shapeor image frame.

[0083] When the image frame and mouth shape are combined, theregistration pixels on the mouth shape are mapped onto the registrationpixels on the head. In one embodiment, a mathematical algorithm is usedto discover the 3×2 linear transformation that maps the mouth onto thehead. The transformation matrix allows the mouth image to be rotated,scaled up or down, and moved, until the registration pixels align. Inone embodiment, the transformation matrix is applied to the mouth shapeusing high-quality image re-sampling which minimizes artifacts such asaliasing. In one embodiment, the Intel Image Processing Software Libraryis used to perform the image transformation and compositing.

[0084] The effect of the alignment is that the mouth is positioned,rotated, and stretched over the background head, so that the mouthappears in the correct size and orientation relative to the particularhead image. One embodiment introduces slight variation in the positionof the registration points in order to make the animation appear morerealistic. For example and in one embodiment, adding or subtracting arandom value within 5 pixels of the original location of theregistration pixels, will make the mouth appear to move slightly.

[0085] In another embodiment, volume data is used to scale the mouthshapes, resulting in a change in the distance between registrationpixels on the mouth shape. In one form of this embodiment, the resultingmouth shape is laterally stretched or compressed such that theregistration pixels of the mouth shape resume their original distanceand, therefore, align with the registration pixels of the head.Accordingly, a higher volume phoneme results in a wider-opened mouthshape, while a lower volume phoneme results in a more narrowly openedmouth shape. In another embodiment, the mouth shapes are scaled withoutaffecting the distance between registration pixels. In yet anotherembodiment, volume data is used to choose one from a set of differentlysized mouth shapes corresponding to a single phoneme.

[0086] After the registration pixels are aligned, the mouth shape iscomposited over the head to produce an animation frame comprising a headand a mouth shape as if they were drawn together. In one embodiment, thehead and mouth shape can be further composited over a background. Inanother embodiment, the image frame includes both the head and thebackground.

[0087] As discussed above, this process is repeated for every timesegment in the animation, defined by the frame rate of the animation.Note that if several consecutive animation frames have the same mouthshape, the mouth shape may still need to be aligned since the head maymove as the sequence of image frames progresses.

[0088] 3. Adding Pre-Animated Scenes

[0089] Once the animated sequence having synchronized audio and visualcharacteristics is complete, one embodiment of animation compositing anddisplay module 44 allows for the addition of pre-animated sequences tothe beginning and/or end of the animated sequence created by the user.As discussed above, the entire sequence can then be transformed into amultimedia movie according to conventional techniques.

[0090] 4. Real-Time Playback

[0091] In another embodiment, the apparatus of the present inventionincludes a real-time playback mode. In one embodiment, this optionallows the user to preview the resulting animation before saving it in adigital file animation format. In another embodiment, the real-timeplayback mode enables real-time animation of chatroom discussions orother communications over computer networks. (See Section E, infra.)

[0092]FIG. 6 illustrates a method providing a real-time playback modefor use in the present invention. According to this method, whenplayback of the audio data begins (FIG. 6, step 512), timecontroller/sampling module 42 detects or keeps track of the playbacktime (FIG. 5, step 516). In one embodiment, the delay, TD, incompositing and displaying an animation frame is added to the playbacktime (FIG. 5, step 518). This time value is used to retrieve the imageframe corresponding to the playback time (step 520) and sample the eventsequence (FIG. 5, step 522) in order to retrieve the corresponding mouthshape. The image frame and the mouth shape are combined, as discussedabove, to create an animation frame (FIG. 6, step 524), which isdisplayed to the user (step 526). This real-time loop is repeated forthe duration of the event sequence (FIG. 6, step 514). In oneembodiment, the user interface allows the user to stop the animation atany time during playback. Optionally and in one embodiment, the firstanimation frame is assembled and displayed before audio playback begins.(See FIG. 6, steps 502, 504, 506, 508 and 510).

[0093] E. Application to Computer Network

[0094] As FIG. 7 shows, one embodiment of the present invention can beapplied to a computer network. According to this embodiment, animationsystem 30 comprises communications server 32 operably connected toanimation server 34 and computer network 60. Users at client computers70 input and record audio data using microphones 72 and transmit theaudio data to animation server 34 via web server 32. As with some of theembodiments described above, users select from one of a plurality ofanimated characters and background scenes to combine with the audiodata. Animation server 34 then constructs an animated sequence asdescribed above. According to one embodiment, users can preview theanimation by using the real-time playback mode. After the animatedsequence is generated, one embodiment allows the option to specify thedigital animation format into which the animated sequence is to beconverted. Users can download the resulting file on client computer 70and/or transmit the animation file to others, for example, as an e-mailattachment

[0095] In one embodiment, users input and record audio data employingexisting audio facilities resident on client computer 70. Users thenaccess animation system 30 and transmit the audio data. In anotherembodiment, users access animation system 30, which downloads a modulethat allows the user to input and record audio data and transmit theaudio data to animation system 30. In one embodiment, such a module is aJava applet or other module. In another embodiment, the module comprisesnative code. In another embodiment, the recording module could bedownloaded and installed as a plug-in to the browser on client computer70.

[0096] The present invention can also be integrated into a chatroomenvironment. In one embodiment, users at client computers 70 connectedto computer network 60 log into a chatroom by accessing a chatroomserver, as is conventional. In one embodiment, the chatroom pagedisplays one to a plurality of animated characters each controlled bythe audio data transmitted by a chatroom user. According to thisembodiment, using microphone 72, a user enters audio data into clientcomputer 70, which is recorded in digital form and sent to all otherchatroom users as a WAV or other sound file. As discussed more fullybelow, a phonetic code sequence or an event sequence is also transmittedwith the audio data. These data are then used to control the lip/mouthmovements of the animated character corresponding to the user.

[0097] 1. Transmitting a Phonetic Code Sequence or Event Sequence

[0098] In one embodiment, a client computer 70 includes thefunctionality to generate a phonetic code sequence from the audio data.In one form of this embodiment, a phonetic code sequence module isdownloaded to client computer 70 when the user logs into the chatroom.In another form, the module is installed as a client-side plug-in ordownloaded as a Java applet. In any form of this embodiment, thephonetic code sequence module detects a phonetic code sequence in theaudio data. In one embodiment, this phonetic code sequence istransmitted in connection with the audio data to the chatroom server,which transmits the data to other chatroom users (see below).

[0099] In another embodiment, an event sequencing module is alsotransmitted as a plug-in or applet to client computer 70, when the userlogs into the chatroom. In this form, the event sequence modulegenerates an event sequence from the phonetic code sequence (seediscussion, supra). Accordingly, the event sequence and audio data aretransmitted to the chatroom server for transmission to other users. Inanother embodiment, the phonetic code sequence and event sequencemodules are stored permanently on client computer 70.

[0100] In one embodiment, the chatroom server constructs an animationframe or animation sequence and transmits the animation frame andstreaming audio data to other chatroom users. According to thisembodiment, the chatroom server generates subsequent animation frames,according to the real-time playback mode discussed above (see FIG. 6),and transmits these animation frames to the chatroom users. Accordingly,users at client computers 70 hear the audio data and view an animationsequence synchronized with the audio data.

[0101] 2. Receiving Audio Data and Phonetic Code or Event Sequence

[0102] According to another embodiment, client computers include thefunctionality to receive a phonetic code or event sequence and constructan animation sequence in synchronization with the audio data.

[0103] In one form, client computers 70 each include the functionalitynecessary to composite and display animation frames integrated with acomputer network communications application, such as a browser or othercommunications device. In one form, client computers 70 include a mouthshape and image frame database, a time controller/sampling module and ananimation compositing and display module (see above). According to thisembodiment, the image frame database stores a sequence of image frames.In one form, since the duration of the chatroom session is unknown, theanimation compositing module is configured to loop the sequence of imageframes by starting at the first image frame in the sequence after it hasreached the end. In any form, client computers display the animatedchatroom characters and receive data packets, which control the mouthmovements of the displayed characters. According to this embodiment, therecipient client computers 70 receive a packet of audio data and acorresponding phonetic code or event sequence from the chatroom server.In one embodiment, client computer 70 then plays back the audio data tothe user and constructs and displays a series of animation frames fromthe received phonetic code or event sequence, according to the real-timeanimation method illustrated in FIG. 6. In one embodiment, clientcomputer 70 is fed a series of data packets including audio data andphonetic code or event sequences.

[0104] 3. Direct Communication Between Users

[0105] Yet another embodiment of the present invention featuresreal-time animation of a conversation between to users over a computernetwork. More particularly and in one embodiment, client computers 70each display two animated characters, each representing a particularuser. According to one embodiment, the respective users' computersinclude the functionality, described above, to record audio data, detectphonetic code sequences, generate event sequences, as well as compositeand display a sequence of animated frames.

[0106] In one form of this embodiment, two users desiring to use thesystem access an animation control server on a computer network. Theanimation control server allows the users to select the animatedcharacters they wish to display. When the users have selected theanimated characters, the animation control server downloads to eachcomputer a sequence of image frames and a set of mouth shapes for eachanimated character.

[0107] According to this embodiment, users speak into a microphoneoperably connected to their respective computers.. As described above, aphonetic code sequence is detected in the resulting audio data, fromwhich an event code sequence is generated. The audio data and thephonetic code or event sequence is transmitted as a packet to theintended recipient's computer.

[0108] In one embodiment, the recipient's computer receives the packet,and plays back the audio data while executing the real-time animationloop described in FIG. 6. Accordingly, the lip or mouth animation ofthese animated characters is controlled by the phonetic code or eventsequences detected in the audio data on the user's computer andtransmitted to the recipient computer. In one embodiment, clientcomputers exchange packets of audio data and phonetic code or eventsequences over the computer network.

[0109] With respect to the above-provided description, one skilled inthe art will readily recognize that the present invention hasapplication in a variety of contexts. The foregoing descriptionillustrates the principles of the present invention and providesexamples of its implementation. For example, although certainembodiments are described as working in conjunction with an Internetbrowser, the present invention may be used in connection with anysuitable software application for accessing files on a computer network.Moreover, one skilled in the art will readily recognize that theimplementation details and parameters of operation can be widely variedto achieve many different objectives. Accordingly, the description isnot intended to limit the scope of the claims to the exact embodimentsshown and described.

1.-33. (cancelled)
 34. A method for generating an animated sequencehaving synchronized visual and audio characteristics during the playback of audio data, said method executed in a computing device includinga mouth shape database including a plurality of mouth shapescorresponding to events and an image frame database storing a pluralityof image frames, at least one of said image frames including an animatedcharacter, said method comprising the steps of (a) receiving audio data;(b) detecting a phonetic code sequence in said audio data; (c)generating an event sequence from said phonetic code sequence; andduring the play back of said audio data: (d) tracking the audio playbacktime in said step (d); (e) sampling said event sequence using saidplayback time tracked in step (d); (f) constructing an animation framebased on an image frame selected from the image frame database and themouth shape corresponding to the event sampled in said sampling step(e); (g) displaying the animation frame; and (h) repeating steps (e)-(g)a desired number of times.
 35. The method of claim 34 wherein steps(e)-(g) are repeated for the duration of said audio data.
 36. The methodof claim 34 wherein said event sequence sampled in step (f) is sampledat a predetermined interval from said playback time detected in saidstep (e).
 37. The method of claim 34 further comprising (i) monitoringthe delay associated with the constructing and displaying of animationframes in steps (f) and (g); and wherein the sampling step (e) is basedon said playback time tracked in step (d) and said delay monitored instep (i). 38.-45. (cancelled)
 46. A method for driving a user interfacedisplaying at least one animated character, said method comprising thesteps of (a) receiving a at least one packet, said at least one packetcomprising audio data and a phonetic code sequence; (b) generating anevent sequence using said phonetic code sequence; (c) playing back saidaudio data; (d) tracking the audio playback time; (e) sampling saidevent sequence using said playback time tracked in step (d); (f)displaying an animation frame based on said sampling step (e); and (g)repeating steps (d)-(g) a desired number of times.
 47. The method ofclaim 46 wherein steps (d)-(g) are repeated for the duration of saidaudio data.
 48. The method of claim 46 wherein said event sequencesampled in step (e) is sampled at a predetermined interval from saidplayback time detected in said step (d).
 49. The method of claim 46further comprising (h) monitoring the delay associated with theconstructing and displaying of animation frames in steps (f) and (g);and wherein the sampling step (e) is based on said playback time trackedin step (d) and said delay monitored in step (h).
 50. A method fordriving a user interface displaying at least one animated character,said method comprising the steps of (a) receiving a at least one packet,said at least one packet comprising audio data and an event sequence;(b) playing back said audio data; (c) tracking the audio playback timein said step (d); (d) sampling said event sequence using said playbacktime tracked in step (c); (e) displaying an animation frame based onsaid sampling step (d); and (f) repeating steps (c)-(f) a desired numberof times.
 51. The method of claim 50 wherein steps (c)-(f) are repeatedfor the duration of said audio data.
 52. The method of claim 50 whereinsaid event sequence sampled in step (d) is sampled at a predeterminedinterval from said playback time detected in said step (c).
 53. Themethod of claim 50 further comprising (h) monitoring the delayassociated with the constructing and displaying of animation frames insteps (e) and (f); and wherein the sampling step (d) is based on saidplayback time tracked in step (c) and said delay monitored in step (h).